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Abstract. Several computational problems in phylogenetic reconstruction can be for- 
mulated as restrictions of the following general problem: given a formula in conjunctive 
normal form where the literals are rooted triples, is there a rooted binary tree that satisfies 
the formula? If the formulas do not contain disjunctions, the problem becomes the famous 
rooted triple consistency problem, which can be solved in polynomial time by an algorithm 
of Aho, Sagiv, Szymanski, and Ullman. If the clauses in the formulas are restricted to dis- 
junctions of negated triples, Ng, Steel, and Wormald showed that the problem remains 
NP-complete. We systematically study the computational complexity of the problem for 
all such restrictions of the clauses in the input formula. For certain restricted disjunctions 
of triples we present an algorithm that has sub-quadratic running time and is asymptot- 
ically as fast as the fastest known algorithm for the rooted triple consistency problem. 
We also show that any restriction of the general rooted phylogeny problem that does not 
fall into our tractable class is NP-complete, using known results about the complexity of 
Boolean constraint satisfaction problems. Finally, we present a pebble game argument 
that shows that the rooted triple consistency problem (and also all generalizations studied 
in this paper) cannot be solved by Datalog. 



Rooted phylogeny problems are fundamental computational problems for phylogenetic re- 
construction in computational biology, and more generally in areas dealing with large a- 
mounts of data about rooted trees. Given a collection of partial information about a rooted 
tree, we would like to know whether there exists a single rooted tree that explains the data. 
A concrete example of a computational problem in this context is the rooted triple consis- 
tency problem. We are given a set V of variables, and a set of triples ab\c with a,b,c £ V, 
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and we would like to know whether there exists a rooted tree T with leaf set V such that for 
each of the given triples ab\c the youngest common ancestor of a and b in this tree is below 
the youngest common ancestor of a and c (if such a tree exists, we say that the instance is 
satisfiable) . 

The rooted triple consistency problem has an interesting history. The first polyno- 
mial time algorithm for the problem was discovered by Aho, Sagiv, Szymanski, and Ull- 
man [ASSU81J, motivated by problems in database theory. This algorithm was later redis- 
covered for phylogenetic analysis Si< > !)2 . Henzinger, King, and Warnow [HKW96] showed 
how to use decremental graph connectivity algorithms to improve the quadratic runtime 
0(mn) of the algorithm by Aho et al. to a deterministic algorithm with runtime 0{rriy/n). 

Dekker [Dck86 asked the question whether there is a finite set of 'rules' that allows 
to infer a triple ab\c from another given set of triples if all trees satisfying $ also satisfy 
ab\c. This question was answered negatively by Bryant and Steel [BS95j. Dekker's 'rules' 
have a very natural interpretation in terms of Datalog programs. Datalog as an algorithmic 
tool for rooted phylogeny problems is more powerful than Dekker's rules. We say that a 
Datalog program solves the rooted triple consistency problem if it derives a distinguished 
0-ary predicate false on a given set of triples if and only if the instance of the rooted triple 
consistency problem is not satisfiable. One of the results of this paper is the proof that 
there is no Datalog program that solves the rooted triple consistency problem. 

Datalog inexpressibility results are known to be very difficult to obtain, and the few 
existing results often exhibit interesting combinatorics [KV951 IA"SY914 IFV99~1 IGro94l IBK10] . 
The tool we apply to show our result, the existential pebble game, originates in finite model 
theory, and was successfully applied to finite domain constraint satisfaction [KV98J. A 
recent generalization of the intimate connection between Datalog and the existential pebble 
game to a broad class of infinite domain constraint satisfaction problems [BD08J allows us 
to apply the game to study the expressive power of Datalog for the rooted triple consistency 
problem. 

There are several other important rooted phylogeny problems One is the subtree avoid- 
ance problem, introduced by [NSWOOJ, or the forbidden triple problem |Bry97|; both are 
NP-hard. It turns out that all of those problems and many other rooted phylogeny prob- 
lems can be conveniently put into a common framework, which we introduce in this paper. 

A rooted triple formula is a formula <I> in conjunctive normal form where all literals are 
of the form ab\c. It turns out that the problems mentioned above and many other rooted 
phylogeny problems (we provide more examples in Section [2J can be formalized as the 
satisfiability problem for a given rooted triple formula $ where the set of clauses that might 
be used in $ is (syntactically) restricted. If C is a class of clauses, and the input is confined 
to rooted triple formulas with clauses from C, we call the corresponding computational 
problem the rooted phylogeny problem for clauses from C. 

In this paper, we determine for all classes of clauses C the computational complexity of 
the rooted phylogeny problem for clauses from C. In all cases, the corresponding computa- 
tional problem is either in P or NP-complete. In our proof of the complexity classification 
we apply known results from Boolean constraint satisfaction. The rooted phylogeny prob- 
lem is closely related to a corresponding split problem (defined in Section [4]), which is a 
Boolean constraint satisfaction problem where we are looking for a surjective solution, i.e., 
a solution where at least one variable is set to true and at least one variable is set to false. 
The complexity of Boolean split problems has been classified in jCKSOlj . If C is such that 
the corresponding split problem can be solved efficiently, our algorithmic results in Section [4] 
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show that the rooted phylogeny problem for clauses from C can be solved in polynomial 
time. Conversely, we present a general reduction that shows that if the split problem is 
NP-hard, then the rooted phylogeny problem for C is NP-hard as well. 

2. Phylogeny Problems 

We fix some standard terminology concerning rooted trees. Let T be a tree (i.e., an undi- 
rected, acyclic, and connected graph) with a distinguished vertex r, the root of T. The 
vertices with exactly one neighbor in T are called leaves. The vertices of T are denoted by 
V(T), and the leaves of T by L(T) C V(T). For u, v 6 V(T), we say that u lies below v if 
the path from u to r passes through v. We say that u lies strictly below v if u lies below 
v and u ^ v. The youngest common ancestor (yea) of two vertices u,v £ V(T) is the node 
w such that both u and v lie below w and w has maximal distance from r. Note that the 
yea, viewed as a binary operation, is commutative and associative, and hence there is a 
canonical definition of the yea of a set of elements u±, . . . , Uk- The tree T is called binary 
if the root has two neighbors, and every other vertex has either three neighbors or one 
neighbor. A neighbor u of a vertex v is called a child of v (and v is called the parent of 
in T if the distance of u from the root is strictly larger than the distance of v from the root. 
We write uv\w (or say that uv\w holds in T) if u, v, w are distinct leaves of T and yca(u, v) 
lies strictly below yca(u, w) in T. Note that for distinct leaves u, v, w of any binary tree T, 
exactly one of the triples uv\w, uw\v, and vw\u holds in T. 

Definition 2.1. A rooted triple formula is a (quantifier-free) conjunction of clauses (also 
called triple clauses) where each clause is a disjunction of literals of the form xy\z. 

Example 2.2. An example of a triple clause is xz\y V yz\x; it will also be denoted by xy\ z. 
Another example of a triple clause is xy\z\ V xy\zi- 

The following notion is used frequently in later sections. If $ is a formula, and S is 
a subset of the variables of <&, then $>[S] denotes the conjunction of all those clauses in 
that only contain variables from S. 

Definition 2.3. A rooted triple formula $ is satisfiable if there exists a rooted binary tree 
T and a mapping a from the variables of <3? to the leaves of T such that in every clause 
at least one literal is satisfied. A literal xy\z is satisfied by (T,a) if a(x),a(y),a(z) are 
distinct and if yca(a(x),a(y)) lies strictly below yca(a(x),a(z)) in T. The pair (T,a) is 
then called a solution to 3>. 

We would like to remark that a rooted triple formula $ is satisfiable if and only if there 
exists a rooted binary tree T and an injective mapping a from the variables of $ to the 
leaves of T such that the formula evaluates under a to true. 

Example 2.4. Let $ = xz\y V A be a rooted triple formula with variables 

V = {w, x, z, y}. Then the tree T 




x z y 

together with the identity mapping on V is a solution to <£. 

A fundamental problem in phylogenetic reconstruction is the rooted triple consistency 
problem [HKW96] IBS951 [Ste92|. |A"SSU81] . This problem can be stated conveniently in terms 
of rooted triple formulas. 
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Problem 2.5 (Rooted- Triple-Consistency). 

INSTANCE : A rooted triple formula <3? without disjunction. 

QUESTION: Is $ satisfiable? 

The following NP-complete problem was introduced and studied in an equivalent for- 
mulation by Ng, Steel, and Wormald [NSWOOj . 

Problem 2.6 (Subtree- Avoidance-Problem). 

INSTANCE : A rooted triple formula $ where each clause is of the form X\y\\ Z\ V • • ■ V 



XkVk\ z k- (As in Example 2.2 xy\z stands for xz\y V yz\x.) 
QUESTION: Is 3> satisfiable? 

Also the following problem is NP-hard; it has been studied in |Bry97| . 

Problem 2.7 (Forbidden- Triple-Consistency). 

INSTANCE : A rooted triple formula <3? where each clause is of the form xy\z. 
QUESTION: Is $ satisfiable? 

More generally, if C is a class of triple clauses, the rooted phylogeny problem for clauses 
from C is the following computational problem. 

Problem 2.8 (Rooted-Phylogeny for clauses from C). 

INSTANCE : A rooted triple formula 3> where each clause can be obtained from clauses in 

C by substitution of variables. 
QUESTION: Is $> satisfiable? 

All of these problems belong to NP. A given solution (T, a) can be verified in polynomial 
time using the following deterministic algorithm. For each literal of each clause of $ check 
whether the literal is satisfied. If there is at least one literal per clause satisfied by (T,a), 
then the given solution is valid else it is invalid. A literal ab\c is satisfied if a(a), a(b), 
and ot(c) are distinct and if v\ = yca(a(a), a(b)) lies strictly below v<i = yca(a(a), a(c)) 



(recalling definition 2.3). Determining the youngest common ancestor of two vertices is 
straightforward using a bottom-up search for each vertex. Another search is then used to 
check if v\ lies strictly below V2 ■ 

Note that the rooted triple consistency problem, the subtree avoidance problem, and 
the forbidden triple consistency problem are examples of rooted phylogeny problems, by 
appropriately choosing the class C. For example, for the rooted triple consistency problem 
we choose C = {xy\z}. The subtree avoidance problem is the rooted phylogeny problem for 
the class C that contains for each k the clause x\y\\ z\ V • • • V xtyt\ Zk- 

Finally, note that when C contains clauses with literals of the form xx\y, xy\x, or xy\y, 
then these literals can be removed from the clause since they are unsatisfiable. If all literals 
in a triple clause are of this form, then the clause is unsatisfiable. It is clear that in instances 
of the rooted phylogeny problem for clauses from a fixed class C one can efficiently decide 
whether the input contains such clauses (in which case the input is unsatisfiable). Thus, 
removing such clauses from C does not affect the complexity of the rooted phylogeny for 
clauses from C. To prevent dealing with degenerate cases, we therefore make the convention 
that all clauses in C do not contain literals of the form xx\y, xy\x, or xy\y. 
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Constraint Satisfaction Problems. Many phylogeny problems can be viewed as infinite 
domain constraint satisfaction problems (CSPs), which are defined as follows. Let T be 
a structur^] with a finite relational signature r. A first-order formula over r is called 
primitive positive if it is of the form 3x±, . . . , x n . ty\ A • • • A VVn where t/ji, . . . , ip m are atomic 
formulas over r, i.e., of the form x = y or R(xx, . . . ,x^) for a A:-ary R £ r. Then the 
constraint satisfaction problem for T, denoted by CSP(T), is the computational problem 
to decide whether a given primitive positive sentence (i.e., a primitive positive formula 
without free variables) is true in T. The sentence is also called an instance of CSP(r), 
and the clauses of $ are also called the constraints of $. We cannot give a full introduction 
to constraint satisfaction and to constraint satisfaction on infinite domains, but point the 
reader to [BJK051 lBod08j . Here, we only specify an infinite structure A that can be used 
to describe the rooted triple consistency problem as a constraint satisfaction problem. It 
will then be straightforward to see that all rooted phylogeny problems for clauses from a 
finite class C can be formulated as infinite domain CSPs as well. 

The signature of A is {|} where | is a ternary relation symbol. The domain of A is 
N — > {0, 1}, i.e., the set of all infinite binary strings (hence, the domain of A is uncountable). 
For two elements /, g of A, let lcp(/, g) be the set {1, . . . , n} where n is the largest natural 
number i such that f(j) = g(j) for all j G {1, . . . , i}\ if no such i exists, we set lcp(/, g) := 0, 
and if / = g, we set lcp(/, g) := N. The ternary relation fg\h in A holds on elements /, g, h 
of A if they are pairwise distinct and | lcp(/, g)\ > | lcp(/, h)\. 

The following lemma shows that instances of the rooted triple consistency problem can 
be viewed as primitive positive formulas over the signature {|}. 

Proposition 2.9. A rooted triple formula &(x\, . . . ,x n ) is satisfiable if and only if the 
sentence 3x\, . . . , x n . &(x\, . . . , x n ) is true in A. 

Proof. Suppose that 3x\, . . . , x n . <&(xi, . . . , x n ) is true in A, and let /i, . . . , f n : N — )■ {0, 1} 
be witnesses for xi, . . . ,x n that satisfy <5 in A. We define a finite rooted tree T as follows. 
The vertex set of T consists of the restrictions of fi to lcp(/j, fj) for all 1 < i,j < n (we 
do not require i and j to be distinct). Vertex g is above vertex g' in T if g' extends g; it is 
clear that this describes T uniquely. Note that fi, ■ ■ ■ , f n are exactly the leaves of T, and 
that T is binary. Let a be the map that sends Xi to fi. Then (T, a) satisfies <£. 

Conversely, let (T, a) be a solution to For each vertex v of T that is not a leaf, let 
l{v) and r{v) be the two neighbors of v in T that have larger distance from the root than v. 
Let h be the length of the path r = pi, . . . ,ph = a{xi) from the root r to a{xi) in T. Define 
fi : N — )■ {0, 1} by setting /i(j) = if Pj+i = l(pj),l < j < h, and fi(j) = 1 otherwise. 
Clearly, the elements /i, . . . , f n of A show that 3x±, . . . , x n .Q(xi, . . . , x n ) is true in A. □ 

This shows that the rooted triple consistency problem is indeed a constraint satisfaction 
problem. A refined version of this observation will be useful in Section [3] to apply known 
techniques for proving Datalog inexpressibility of the rooted triple consistency problem. 

A triple clause is called trivial if the clause is satisfied by any injective mapping from the 



variables into the leaves of any rooted tree. The following lemma (Lemma 2.10) shows that 
the rooted triple consistency problem is among the simplest rooted phylogeny problems, 
that is, for every class C that contains a non-trivial triple clause the rooted phylogeny 
problem for C can simulate the rooted triple consistency problem in a simple way. 



^We follow standard terminology in logic, see e.g. [Hod93| . 
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Lemma 2.10. Let (j)(xi, . . . , X/.) be a non-trivial triple clause. Then there are variables 
y{,. . . ,yl, ■ ■ ■ ,y[, ■ ■ ■ ,y[ G {a, b, c} such that f\ i=1A 4>(y\, ■ ■ ■ , y\) is logically equivalent to 
ab\c. 

Proof. First observe that if k = 3 and if (f>(xi, X2, £3) contains only one literal then renaming 
its variables is trivial. Second, if (j)(xx,X2, £3) = xix^\x2 V X2X^\xi, then (p(a, c, b) A 4>(b, c, a) 
is logically equivalent to ab\c. If (f>(xi, X2,x^) contains three or more literals, then due to its 
non-triviality there can only be at most two distinct literals. Thus, we fall back to one of 
the already shown cases and the claim follows for all clauses with exactly three variables. 

If k > 3, then non-triviality of 4> implies that (f)(xi, . . . , Xj.) can be written as X{ x Xi 2 |xj 3 V 
4>'(xi, . . . ,Xk) for distinct variables x^ , Xi 2 , Xi 3 such that 0' does not imply Xi 1 Xi 2 \ Xi s , or 
as Xi x Xi 2 \xi 3 V (j>'{x\, . . . ,Xk) for distinct variables Xi x , Xi 2 , Xj 3 such that 4>' does not imply 
XjjXjjIxjg. In both cases we can falsify all literals in <f>' that contain a variable Xj 4 distinct 
from Xi ± , Xi 2 , Xi s by making Xj 4 equal to some other variable in this literal. The claim then 
follows from the case k = 3. D 

This implies that the Datalog inexpressibility result for the rooted triple consistency 
problem we present in the next section applies to all the rooted phylogeny problems for 
clauses from C that contain a non-trivial clause. 



3. Datalog 

Datalog is an important algorithmic concept originating both in logic programming and in 
database theory [AHV951 IEF991 IImm98| . Feder and Vardi [FV99] observed that Datalog 
programs can be used to formalize efficient constraint propagation algorithms used in Ar- 
tificial Intelligence [A1183, Mon74l lDec92l IMac77j . Such algorithms have also been studied 
for the phylogenetic reconstruction problem. Dekker |Dek86j studied rules that infer rooted 
triples from given sets of rooted triples, and asked whether there exists a set of rules such 
that a rooted triple can be derived by these rules from a set of rooted triples $ if and 
only if it is logically implied by <E>. This question was answered negatively by Bryant and 
Steel [BS95] . 

In this section, we show the stronger result that the rooted triple consistency problem 
cannot be solved by Datalog. This is a considerable strengthening of this previous result by 
Bryant and Steel, since we can use Datalog programs not only to infer rooted triples that 
are implied by other rooted triples, but rather might use Datalog rules to infer an arbitrary 
number of relations (aka IDBs) of arbitrary arity to solve the problem. Moreover, we only 
require that the Datalog program derives false if and only if the instance is unsatisfiable. 
In particular, we do not require that the Datalog program derives every rooted triple that 
is logically implied by the instance (which is required for the question posed by Dekker). 
Finally, as already announced in the conference version of this paper, we show that the 
proof technique extends to other constraint formalisms for reasoning about trees. 

In our proof, we use a pebble-game that was introduced to describe the expressive power 
of Datalog [KV95] and which was later used to study Datalog as a tool for finite domain 
constraint satisfaction problems [FV99J. The correspondence between Datalog and pebble 
games extends to infinite domain constraint satisfaction problems for countably infinite 
w-categorical structures. A countably infinite structure is called u -categorical if its first- 
order theorjn has exactly one countable model up to isomorphism. It can be seen (e.g. 



The first-order theory of a structure is the set of first-order sentences that are true in the structure. 
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using the theorem of Ryll-Nardzewski, see |Hod9 3]) that the structure A introduced in 
Section [2] is, unfortunately, not w-categorical. However, there are several ways of defining 
an w-categorical structure A (described also in [Cam90 ) which has the same constraint 
satisfaction problem. 

We exactly follow the axiomatic approach to define such a structure A given in [XN98J . 
A ternary relation C is said to be a C-relation on a set L if for all a,b,c,d£ L the following 
conditions hold: 

(CI) C{a;b,c) -> C(a;c,b); 

(C2) C{a;b,c) -> ^C(b;a,c); 

(C3) C(a; b, c) -> C(a; d, c) V C(d; b, c); 

(C4) a C(a;b,b). 
A C-relation is called dense if it satisfies 

(C7) C(a; b, c) -»■ 3e. (C(e; 6, c) A C(a; b, e)). 
The structure (L; C) is also called a C-set. 

A structure T is called k-transitive if for any two fc-tuples (oi, . . . , a*;) and (61, . . . ,bk) 
of distinct elements of T there is an automorphism^] of T that maps Oj to b\ for all i < k. A 
structure T is said to be relatively k-transitive if for every partial isomorphism / between 
induced substructures of T of size k there exists an automorphism of T that extends /. Note 
that a relatively 3-transitive C-set is necessarily 2-transitive. 

Theorem 3.1 (Theorem 14.7 in |AN98j ). Let (L; C) be a relatively 3-transitive C-set. Then 
(L; C) is uj- categorical. 

Theorem 11.2 and 11.3 in [AN98] show how to construct such a C-relation from a semi- 
linear onier]^] that is dense, normal, and branches everywhere (all these concepts are defined 
in |AN98j ) . Such a semi-linear order is explicitly constructed in Section 5 of [SN98 . 

In fact, there is, up to isomorphism, a unique relatively 3-transitive countable C-set 
which 

• is uniform with branching number 2, that is, if for all a,b,c E L we have C{a;b,c) V 
C(b; c, a) V C(c;a,b), 

• is dense, and 

• satisfies ^C(a; a, a) for all a £ L. 

(See the comments in |AN98| after the statement of Theorem 14.7; the condition that 
-iC(a;a,a) for all (equivalently, for some) a G L has been forgotten there, but is necessary 
to obtain uniqueness.) 

In the following, let A be the structure whose domain is the domain of the dense C-set 
that is uniform with branching number 2; the signature of A is not the C-relation, but the 
relation xy\z defined from the C-relation by 

xy\z 43- C(z; x, y) A x 7^ y A y 7^ z A x 7^ z . 

Structures that are first-order definable in w-categorical structures are w-categorical (The- 
orem 7.3.8 in |Hod93j ). so in particular A is w-categorical. Note that the relation | of A 
satisfies (CI), (C2), (C3), but not (C4). 

3 An automorphism of a structure V is an isomorphism between F and itself. 

4 A poset is connected if for any two a, b there exists a c such that a < c and b < c, or a > c and b > c. 
A connected poset is called semi-linear if for every point, the set of all points above it is linearly ordered. 
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The following observation has already been made in |Bod08j, but without proof, so we 
provide a proof here. 

Proposition 3.2. A rooted triple formula &(x\, . . . ,x n ) is satisfiable if and only if the 
sentence 3x\, . . . , x n . &(xi, . . . , x n ) is true in A. 

Proof. Suppose that there are oi, . . . , a n such that 3>(ai, . . . , a n ) is true in A. We first define 
a binary relation ■< on the set of all pairs (a, 6) with a, b G {ai, . . . , a n }. We set (a, b) X (c, d) 
if -icc£|a A -icd|6, and define := {(«, v ) \ u ■< v A v ■< u}. 

Lemma 3.3 (Lemma 12.1 in [AN98]). The relation ■< is a preorder, and hence R is an 
equivalence relation. 

Also the following is taken from [AN98J; but to avoid extensive references into the 
proofs there, we give a self-contained presentation here. We claim that the poset X /R 
that is induced by ^ in the natural way on the equivalence classes of R is semi-linear. To 
see this, let (a\, 02)1 (pi, 62), (ci, c%) be such that (01,02) ^ (61,62) and (01,02) ^ (ci,C2). 
We have to show that (61,62) and (01,02) are comparable in If (61,62) 7^ (ci,C2), then 
C1C2I61 or C1C2I62. Suppose in the following that c\C2\bi; the case C1C2I62 is analogous. Since 
(01,02) ^ (ci,C2) we have in particular ^ci 02(01 in A. Recall that the relation | satisfies 
(C3), which can be equivalently written as Va, 6, a (C(a; 6, c) A -*C(d; 6, c)) — > C(o; d, c), so 
we find that aici|6i. By (C2) we have -iai6i|ci. Since (01,02) ^ (61,62) we have -16162 
Axiom (C3) can also be written as Va, 6, a (->C(a; d, c)A->C(d; 6, c)) — )• -iC(a; 6, c), and thus 
-16162IC1. Similarly, -16162IC2. Therefore, (01,02) ^ (61,62), which is what we had to show. 

Next, note that when (^1,^2) and (ei,e2) are incomparable with respect to :<, then 
(di,ei) is an upper bound for (dijtfe) and (ei,e2), that is, (^1,^2) ^ (di,ei) and (ei,e2) ^ 
(di,ei). It follows that ^ /R is indeed a semi-linear order with a smallest element r, and 
there exists a tree T on the equivalence classes of R such that p lies below q in T if for all 
(equivalently, for some) (a, 6) G p and (c, d) £ q we have (c, d) ^ (a, 6). Let a be the map 
that sends Xi to the equivalence class of (oj,Oj); it is straightforward to verify that (T, a) 
satisfies 

Conversely, let (T, a) be a solution to We now determine elements a±, . . . ,a n from 
A, and prove by induction on i that a(x r )a(x s )\a(xt) in T if and only if a r a s |oi in A, for all 
r,s,t < i. This is trivial for n = % = 1, and for n = i = 2 we can choose arbitrary distinct 
elements a\ and 02 from A. Now suppose we have already found elements 01, . . . , a% of A, 
for 2 < i < n, that satisfy the inductive hypothesis. Let v be the vertex in T that has 
the maximal distance from the root of T such that there is an j < i where both a(xj) and 
a(xj + i) lie strictly below v. 

First consider the case that v is the root of T. Then we can choose k, I G {1, . . . ,i} 
such that v = yca(a(xk), a(x;)). Let a be an element of A that is distinct from and ai, 
and by the properties of A (xy\z is uniform with branching number 2) we have that a^a; | a, 
afca|a/ or aak\ai holds. In the first case, we set Oj + i to a. In the second case, by relative 
3-transitivity of A there exists an automorphism /3 of A that maps to ai and that fixes 
a. In this case we set Oj + i to f3(ai). In the third case we proceed similar to the second. In 
all three cases we have a p a q \ai + \ for all p,q < i, which proves the inductive step. 

Next, consider the case that v is not the root of T. In this case, there must be an 
m < i such that a(xj)a(xi+i)\a(x m ); choose m such that the distance between the root 
and yca(a(xj),a(x m )) is maximal. When j is the only index of size at most i such that 
ot(xj) lies below v in T, then density of A (axiom (C7) in the special case that 6 = c) 
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implies that there is an a such that a,ja\a m . We can then set di+i to a. Otherwise, there 
are j',j" < i such that a(xj>)a(xj")\a(xi + i); choose such that the distance between v 
and yca(a(xj'), a(xj")) is minimal. Again we apply density (axiom (C7)) and conclude that 
there is an a such that ajiajn\a and aya\a m . We can then set Oi+i to a. □ 

The Existential Pebble Game. The fact that A is w-categorical allows us to use the exis- 
tential fc-pebble game to establish the Datalog lower bound for the rooted triple consistency 
problem [BD08| . 

The existential fe-pebble game (for a structure T) is played by the players Spoiler and 
Duplicator on an instance $ of CSP(r) and T. Each player has k pebbles, pi, . . . ,pk for 
Spoiler and q±,...,qk for Duplicator; we say that that q% corresponds to pi. Spoiler places 
his pebbles on the variables of <I>, Duplicator her pebbles on elements of V. Initially, none 
of the pebbles is placed. In each round of the game Spoiler picks some of his pebbles. If 
some of these pebbles are already placed on $, then Spoiler removes them from $, and 
Duplicator responds by removing the corresponding pebbles from V. Duplicator looses if at 
some point of the game 

• there is a clause R{x\, . . . , xu) in $ such that xi, . . . ,Xk are pebbled by pj 1} . . . , pj k , and 

• the corresponding pebbles qj x , . . . , qj k of Duplicator are placed on elements b± , . . . , bk in 
T such that R(bi, . . . , bk) does not hold in T. 

Duplicator wins if the game continues forever. We will make use of the following theorem 
from |BD08| . 

Theorem 3.4 (Theorem 5 in |BD08]). Let T be an u- categorical (or finite) structure. Then 
there is no Datalog program that solves CSP(T) if and only if for every k there exists an 
unsatisfiable instance &k of CSP(T) such that Duplicator wins the existential k-pebble game 
on <£fc and T. 



Our Method. The incidence graph G(&) of an instance $ of CSP(r) is the (undirected, 
simple) bipartite graph whose vertex set is the disjoint union of the variables of $ and the 
clauses of An edge joins a variable a and a clause of $ when a appears in (p. A leaf of 
$ is a variable that has degree one in G( < &). An instance has girth k if the shortest cycle of 
its incidence graph has 2k edge^j 

Lemma 3.5. Let T be an l-transitive (for I > 1) to -categorical (or finite) structure with 
relations of arity at most I + 1. Suppose that for every k there exists an unsatisfiable 
instance of girth at least k where every constraint has an injective satisfying assignment. 
Then CSP(T) cannot be solved by Datalog. 

We will see examples for / = 1 and for I = 2 in this paper. Note that by 1-transitivity, 
every unary relation in T either denotes the empty set or the full domain of T. Since ^ only 
contains satisfiable constraints, all unary constraints in are satisfied by every mapping to 
r. So we make in the following the assumption that <3?fc does not contain unary constraints. 

In the proof we use the following concept, inspired by a Datalog inexpressibility result 
that was established for temporal reasoning [BK10]. 



If we view instances in the obvious way as structures rather than formulas, our definition of girth 
corresponds to the standard definition of girth in graph theory. 
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Figure 1: A situation in the proof of Lemma [3 .5| Spoiler just pebbled a, Duplicator is next. 



Definition 3.6. Let be an instance of girth at least k + 1. Then a subset S of at least 2 
and at most k variables of <3? is called dominated if Gs '■= G(&[S]) is connected (and hence 
a tree), and if all but at most one of the leaves of Gs are pebbled. 

The notion of dominated sets allows us to specify a winning strategy for Duplicator for 
the existential /c-pebble game. 



we have to prove that Duplicator wins the 



Proof of Lemma 3.5. To apply Theorem 3.4 
existential A:-pebble game on $^ and T. 

Suppose that in the course of the game, u is an unpebbled leaf of a dominated set S 
with pebbled leaves a\, . . . , a/, and let b\, . . . ,bi be the corresponding responses of Dupli- 
cator. Duplicator will play in such a way that b\,. . . ,bi are pairwise distinct. Moreover, 
Duplicator always maintains the following invariant. Whenever Spoiler places a pebble on 
ai + \, Duplicator can play a value from V such that the mapping that assigns a{ to bi for 
1 < i < I + 1 can be extended to all of S such that this extension is a satisfying assignment 
for$ k [S}. 

The invariant is satisfied at the beginning of the game: when spoiler places a pebble 
on ai, Duplicator can play any value &i, which is a legal move by our assumption that 
does not contain unary constraints. 

Suppose that during the game Spoiler pebbles a variable a. Let Si,...,S p be the 
dominated sets where a is the unpebbled leaf before Spoiler puts his pebble on a. (If there 
is no such dominated set, then p = 0.) Let T\,...,T q be the newly created dominated sets 
after Spoiler put his pebble on a. Note that since each Tj has not been a dominated set 
before Spoiler put his pebble on a, it must contain one unpebbled leaf distinct from a, which 
we denote by rj. For an illustration, see Figure [TJ 

We have to show that under the assumption that Duplicator in her previous moves 
has always maintained the invariant, she will be able to make a move that again fulfills the 
invariant. If p > 0, then the union S of the sets S\, . . . ,S P was itself a dominated set already 
before Spoiler played on a, since Gs is clearly connected (all the Si share the vertex a) and 
no unpebbled leaves can be created by taking a union of dominated sets. The next move 
of Duplicator is the value b from the invariant applied to S. This preserves the invariant, 
since for every i < q, the set Tj U S has been a dominated set already before Spoiler played 
on a: because T« and S share the vertex a, the graph GsuTi is connected, and since a is not 
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Figure 2: A situation in the proof of Lemma 3.5 extending a to all of T{ 



a leaf in GsuT v the only unpebbled leaf of GsuTi is Therefore, a can be extended to all 
of Ti. 

If p = 0, Duplicator plays an arbitrary element b in T. We prove by induction on the 
size of Tj that a can be extended to such that a (a) = b. We can assume that only leaves 
in Gt { are pebbled (otherwise, since Gt { is a tree, the task reduces to proving the statement 
for proper subsets of Tj). Consider a clause (f> of ^fc[T] that contains a, and let V be the 
variables of <j). This clause must be unique: otherwise, the graph obtained from G(&k) by 
removing the vertex a has at least two components. Only one of those components can 
contain rf, the other component must then be a dominated set where all leaves are pebbled, 
a contradiction to the assumption that p = 0. Now consider the graph H obtained from 
G^ by removing the vertex that corresponds to <j). See Figure [2j 

If one of the connected components of H , say C, forms a dominated set, then the unique 
variable v in C D V (uniqueness again follows from the fact that G^ is a tree) is the unique 
unpebbled leaf of C, and by the invariant of Duplicator's strategy a can be extended to a' 
that is defined on all of C such that it satisfies $fc[C]. Hence, by removing the pebbles from 
C and adding a pebble on v, with a'(v) the corresponding response of Duplicator, we can 
apply the inductive assumption to T, \ C U {v} to find an extension of a that is a satisfying 
assignment for ^.[T] and maps a to b. 

Otherwise, all variables in V except for the variable that lies in the connected component 
of rj in H are pebbled. By our assumption on the signature, the clause 4> contains at most 
/ pebbled variables (including a). Also by assumption there exists an injective mapping 
(3 : V — > r that satisfies (p. Since V is /-transitive, there is an automorphism 7 of T that 
maps /3(a) to b and that sends f3(w) to a(u>), for w G T \ {v}. Then we extend a to v by 
a{v) := 7(/3(f)); the extension clearly satisfies <f>. Now we repeat the argument with v in 
place of a, and a(v) in place of b, and are done by inductive assumption. □ 

Application to the Rooted Phylogeny Problem. We now turn back to the rooted 
triple consistency problem, CSP(A). The structure A is 2-transitive and the only relation 



has arity three, and hence we can apply Lemma 3.5 to prove that CSP(A) cannot be solved 
by Datalog. 
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To construct an unsatisfiable girth k instance for CSP(A), let G be a cubic graph 
of girth at least k that has a Hamiltonian cycle. Such a graph exists; see e.g. the comments 
after the proof of Theorem 3.2 in [Big98| . Note that G must have an even number of 
vertices. Let H = (v±, v 2 , ■ ■ ■ ,v n ) be the Hamilton cycle of G. For any vertex a of G, let 
r(a) be the vertex that precedes a on H, s(a) the vertex that follows a on H, and t(a) the 
third remaining neighbor of a in G. 

We now define The vertices of G will be the variables of Then 

% := A r(a)s(a)\t(a) . 

aeV(G) 

Consider the graph on the variables of that has an edge ab when contains a triple 
clause ab\c for some variable c of This graph is connected, since it actually equals the 
Hamilton cycle H of G. Hence, a condition due to Aho et al. |ASSU8i] implies that is 
unsatisfiable for all k > 1. This can also be seen by Lemma 4.3 in Section |4| It is clear 



that every triple clause of has an injective satisfying assignment. So the only remaining 



condition to apply Lemma 3.5 is the verification that has girth k. But this is obvious 

since any cycle of length 21 < 2k in the incidence graph G(&k) would give rise to a cycle of 
length / < k in G, in contradiction to G having girth k. 

Corollary 3.7. There is no Datalog program that solves the rooted triple consistency prob- 
lem. 



Other Applications of the Technique. Our technique to show Datalog inexpressibility 
can be adapted to show that the following (closely related) problems cannot be solved by 
Datalog as well. 

• Satisfiability of branching time constraints [BJ03]: 

• The network consistency problem of the left-linear-point algebra |Due051 IHir97| ; 

• Cornell's tree description logic |Cor94[ |BK07| : 

All these three problems contain the following computational problem as a special case. 

Problem 3.8 (Tree-Description-Consistency). 

INSTANCE: A finite structure (V; <, ||) where < and || are binary relations. 
QUESTION: Is there a rooted tree T and a : V — > V(T) such that if x < y then a(y) lies 

strictly below a(x) in T, and if x\\y then neither a(x) lies below a{y) nor 

a(y) lies below a{x) in T? 



To again apply Lemma 3.5, we first have to show that Tree-Description-Consistency can 
be formulated as a CSP for a transitive w-categorical structure = (D;<,||); this has 
already been observed in [BN06]. This time, it is more convenient to directly construct 
The domain D consists of the set of all non-empty finite sequences of rational numbers. 
For a = (51, <72> • • • j Qn), b = (q[, q' 2 , ■ ■ ■ , q' m ), n < m, we write a < b if one of the following 
conditions holds: 

• a is a proper initial subsequence of b, i.e., n < m and qi = q[ for 1 < i < n; 

• Qi = Qi for 1 < i < n, and q n < q' n . 

The relation || is the set of all unordered pairs of distinct elements that are incomparable 
with respect to <. A proof that £1 is indeed 1-transitive and w-categorical can be found 
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in [AN98J (Section 5). Since the signature is binary, we can again apply Lemma 3.5, and 
have to find unsatisfiable instances of arbitrarily high girth. 

Here we use the fact that Tree-Description-Consistency can simulate the rooted triple 
consistency problem by a simple reduction [BK07J. We construct *f?k from by replacing 
each triple clause of of the form xy\z by the three conjuncts u xyz \\z, u xyz < x, and 
u xyz < V, where u xyz is a newly introduced variable. It can be shown (see |BK07| ) that this 
transformation preserves (un-)satisfiability, and thus ^ is unsatisfiable as well. Moreover, 
the transformation is such that the girth of is not smaller than the girth of Finally, it 



is clear that every conjunct in has an injective satisfying assignment. Hence, Lemma 3.5 
applies, and CSP(O) cannot be solved by Datalog. 



4. The Algorithm 

In this section we show that the rooted phylogeny problem can be solved in polynomial 
time if all clauses come from the following class T, defined as follows. 

Definition 4.1. A disjunction tp := xiyi\zi V • • • V x p y p \z p is called tame if it is trivial or 
if {xi,yi} = {xj,yj} for all 1 < i, j < p. The set of all tame clauses is denoted by T ■ 

The algorithm we present in this section builds on previous algorithmic results about 
the rooted triple consistency problem, most notably [ASSU81, HKW96J. One of the cen- 
tral ideas for the polynomial-time algorithm for the rooted triple consistency problem 
in |ASSU8l] is to associate a certain undirected graph to an instance of the rooted triple 
consistency problem. We generalize this idea to tame clauses as follows. 

Definition 4.2. Let $ be an instance of the rooted triple consistency problem with tame 
clauses. Then F$ := (V,E) is the graph where the vertex set V is the set of variables of <I>, 
and where E contains an edge {x, y} iff $ contains a clause xy\z± V • • • V xy\z p for p > 1. 

The following provides a sufficient (but not a necessary) condition for unsatisfiability 
of rooted triple formulas with tame clauses. 

Lemma 4.3. Let $ be an instance of the rooted phylogeny problem with tame clauses. If 
F$ is connected then is unsatisfiable. 

Proof. Let V be the set of variables in 3>. Suppose that there is a solution (T, a) for Let 
r be the yea of a(V) in T (where a(V) is the set of all leaves in the image of V under a). 
It cannot be that all vertices in a(V) lie below the same child of r in T, since otherwise 
the child would have been above r = yca(a( V)), which is impossible. Since the graph 
F$ is connected, there is an edge {x, y} in F$ such that a(x) and a(y) lie below different 
children of r in T. Hence, there are z\, . . . , z p E V and a clause xy\z\ V • • • V xy\z p in By 
assumption, the yea of a{x) and a(y), which is r, lies strictly below the yea of a(x) and 
a{zi) for some 1 < i < p, a contradiction to the choice of r. □ 

To see that the condition is not necessary consider the following example. 

Example 4.4. The rooted triple formula <1> = {ab\c A bc\a A ab\d) is unsatisfiable since the 
first two literals cannot simultaneously be satisfied. But the graph F$ is disconnected; it 
has the two components {a, b, c} and {d}. 
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Solve($) 

Input: A rooted triple formula $ with variables V and clauses from T '. 
Output: 'yes' if $ is satisfiable, 'no' otherwise. 



If $ is the empty conjunction then return 'yes' 
If F$ is connected 
return 'no' 

else 

Let S be the vertices of a connected component of F$ 
If Solvers']) is false or Solve(3>[V \ S]) is false return 'no' 
else return 'yes' 
end if 
end if 



Figure 3: The algorithm for the rooted phylogeny problem for tame clauses. 



Theorem 4.5. The algorithm Solve in Figure^ determines whether a given instance $ of 
the rooted phylogeny problem for tame clauses is satisfiable. When m is the number of triples 
in all clauses, and n is the number of variables of&, then the algorithm can be implemented 
to run in time 0(mlog 2 n). 

Proof. If is the empty conjunction, then $ is clearly satisfiable, and so the answer of the 
algorithm is correct in this case. The algorithm first computes a connected component S of 
F§ (we discuss details of this step in the paragraph about the running time of the algorithm) ; 



if S = V, i.e., if F§ is connected, then Lemma 4.3 implies that <3? is unsatisfiable. 

Otherwise, we execute the algorithm recursively on &[S] and on $[V \ S]. If any of 
these recursive calls reports an inconsistency, then 3> is clearly unsatisfiable as well: since if 
there was a solution (T, a) to <3>, then (T, a\y) would be a solution to 3>[V]. Otherwise, we 
inductively assume that the algorithm correctly asserts the existence of a solution (T\,ai) 
of $[S] and of a solution (T 2 ,a 2 ) of $[V \ S]. 

Let T be the tree obtained by creating a new vertex r, linking the roots of T\ and 
T2 below r, and making r the root of T. Let a be the mapping that maps x to ati(x) if 
x £ L(Ti), for i £ {1, 2}. We claim that (T, a) is a solution to $, i.e., we have to show that 
in every clause ifi of $ at least one literal is satisfied. If -0 = (xy\zi V • • • V xy\z p ), then x 
and y are in the same subtree T of T, since they are connected by an edge in F$. If all 
variables of ip lie completely inside S or completely inside V \ S, we are done by inductive 
assumption, because (T\,ai) is a solution for &[S] and (T2,a2) is a solution for &[V \ S]. 
Otherwise, there must be a j, 1 < j < p, such that Zj lies in a different component than 
x and y. But in this case the yea of a(x) and a(y) lies strictly below r, which is the yea 
of a{x) and a(zj). Hence, the literal xy\zj in tp is satisfied. This concludes the correctness 
proof of the algorithm shown in Figure [3} 

We still have to show how this procedure can be implemented such that the running 
time is in 0(mlog 2 n). There are amortized sub linear algorithms for testing connectivity 
in undirected graphs while removing the edges of the graph. This was used to speed- 
up the algorithm for the rooted triple consistency problem [HKW96J. At present, the 
fastest known algorithm for this purpose appears to be the deterministic decremental graph 
connectivity algorithm of Holm, de Lichtenberg, and Thorup [THdL98j , which has a query 
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time in 0(logn/loglogn), and an update time in (3(log 2 n). We can use the same approach 
as in |HKW96] and obtain an 0(mlog 2 n) bound for the worst-case running time of our 
algorithm. □ 



5. Complexity Classification 
This section is devoted to the proof of the following result. 

Theorem 5.1. LetC be a set of rooted triple clauses that contains clauses that are not tame 



(Definition 4-D- Then the rooted phylogeny problem for clauses from C is NP-complete. 



Our proof of Theorem 5.1 consists of two parts. In the first part, we show that if C is 
not a subset of T, then a certain Boolean split problem associated to C (defined below) is 
NP-hard. In the second part we show that this Boolean split problem reduces to the rooted 
phylogeny problem for C. 

Definition 5.2 (split formula for Let $ be a rooted triple formula. Then the split 
formula for $ is the Boolean formula obtained from $ by replacing each literal xy\z by 
(x <-> y) A (z V ->z). 

The purpose of the tautological second conjunct z V z is to introduce the variable z, 
which would otherwise not appear in the formula; this becomes relevant in the following. 
If C is a class of triple clauses, we define B(C) to be the set of split formulas for the clauses 
from C. 

A solution to a propositional formula is called surjective if at least one variable is set to 
true and at least one variable is set to false. The split problem for a set of Boolean formulas 
B is the problem to decide whether a given conjunction of formulas obtained from formulas 
in B by variable substitution has a surjective solution. 

We will show that if C is a class of triple clauses that is not a subclass of T, then 
there exists a finite subset C of C such that the split problem for B(C) is NP-complete. In 
the proof of this statement we use the following result, which follows from Theorem 6.12 
in |CKS01| , and is due to |CH97| . The notion of Horn, dual Horn, affine, and bijunctive 
Boolean formulas are standard and introduced in detail in [CKSOlj . Bijunctive formulas 
are also known as 2-CNF formulas. 

Theorem 5.3 (of [CH97J). Let B be a set of Boolean formulas. Then the split problem for 
B is in P if all formulas in B are from one of the following types: Horn, dual Horn, affine, 
bijunctive. In all other cases, B contains a finite subset B' such that the split problem for 
B' is NP-complete. 

We say that a Boolean formula i/j is preserved by an operation / : {0, l} k — > {0, 1} if for 
all satisfying assignments at, . . . , of ip the mapping defined by x h-> f(a\(x), . . . , «fc(x)) 
is also a satisfying assignment for ip. 

Proposition 5.4. IfC is not a subclass ofT, then B{C) is neither Horn, dual Horn, affine, 
nor bijunctive. 

Proof. Let <j> be a clause from C\T. By construction the split formula ip for </> is preserved 
by x i — y —>x and is also preserved by constant operations. Moreover, it is known (and follows 
from |Pos41j ) that every Boolean formula that is preserved by -i, contains the constants, 
and is either Horn, dual Horn, affine, or bijunctive must also be preserved by the operation 
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xor defined as (x, y) \-t (x + y mod 2). So it suffices to show that tp cannot be preserved 
by xor. 

Because 4> is not from T and in particular non-trivial, there is a tree T and an injective 
mapping from the variables V of <j> to the leaves of T such that (T, a) is not a solution to 
0. Moreover, since the clause <ft is not tame, it must contain triples ab\c and uv\z where 
{a, b} ^ {u, v}. Consider the assignment /? that maps x G V to if a(x) is below the first 
child of the yea of a(V) in T, and that maps x to 1 otherwise (which child is selected as the 
first child is not important in the proof). By construction, the assignment j3 does not satisfy 
the split formula for -0, since <p is not satisfied by (T,a). Observe that the assignment f3\ 
that is obtained from /3 by negating the value assigned to a is a satisfying assignment for ip, 
since it satisfies the disjunct ((a «-)■ b) A (c V ->c)) of -0- The assignment fa that is constant 
except for the variable a which is assigned 1 is also a satisfying assignment for ip, because 
-0 satisfies ((u o v) A (u> V But since xor(/?i(x), ^(x)) equals j3{x) for all x € V, this 

shows that ^ is not preserved by xor, which is what we wanted to show. □ 



We now turn to the second part of the proof of Theorem 5.1 The idea to reduce the 
split problem for B(C) to the rooted phylogeny problem for clauses from C is to construct 
instances of the phylogeny problem for C in such a way that $ is satisfiable if and only 
if -B(3>) has a surjective solution. To implement this idea, we construct an instance of 
the phylogeny problem <& that fragments into simple and satisfiable pieces if B(&) has a 
surjective solution. 

Proposition 5.5. Let C be a finite class of triple clauses. Then the split problem for B(C) 
can be reduced in polynomial time to the rooted phylogeny problem for clauses from C. 

Proof. Note that the split formula for a trivial clause is a tautological Boolean formula. 
Hence, if all clauses in C are trivial, then the split problem for B(C) is clearly in P and there 
is nothing to show. Otherwise, we can assume that C contains the clause that just consists 
of ab\c since this clause can be simulated by non-trivial clauses from C by appropriately 



equating variables (Lemma 2.10). 



Suppose we are given an instance of the split problem for B(C^) with clauses V^ii • • ■ ; 
and variables V = {xq, . . . , x n _i}. We create an instance $ of the rooted phylogeny problem 
for C as follows. The variables U of are triples (x,i,j) where x S V, i G {0, . . . ,m — 1}, 
and j 6 {1, . . . , n — 1}. In the following, all indices of variables from V are modulo n. 
Moreover, if m > 1 we will also write (x, i, n) for (x, i + 1, 1) for all i E {0, . . . ,m — 2}. The 
clauses of $ consist of two groups, 3>i and <3>2- 

• To define the first group $i of clauses, suppose that ifii has variables yi,...,y q . Let 
4>i(yi, ■ ■ ■ ,y q ) be the triple clause that defines the Boolean relation from B(C) used in 
ipi(yi, ■ ■ ■ ,y q )- By the assumption that C and B(C) are finite it is clear that fa can be 
computed efficiently (in constant time). We then add the clause fa((yi,i, 1), . . . , (y q , i, 1)) 
to $1- 

• The second group $2 of clauses has for all x s G V, i G {0, . . . , m — 2} (if m = 1 the second 
group of clauses is empty), and j £ {1, . . . , n — 1} the clause 

(x s ,i,j)(x s ,i,j + l)\(x s+ j,i,l) . 

Note that <3?2 only consists of rooted triples, and therefore F$ 2 is defined, and consists of 
exactly n paths of length (n — l)(m — 1). 

We claim that $ is satisfiable if and only if ipi A - • ■ A ip m has a surjective solution. First 
suppose that $ has a solution (T, a). Then the variables U of $ can be partitioned into the 



THE COMPLEXITY OF ROOTED PHYLOGENY PROBLEMS 



17 



variables that are mapped via a below the left child of yca(a(U)), and the ones mapped 
below the right child. Note that both parts of the partition are non-empty. Variables (x, i,j) 
of U that share the first coordinate are in the same part of the partition due to the clauses 
in <3? in the second group. Hence, the mapping that sends x G V to if (x, i,j) is mapped to 
the first part, and that sends i to 1 otherwise is well-defined, and a surjective assignment. 
It also satisfies all clauses ipi, . . . , ip m , because of the first group of clauses in 

Conversely, suppose that there is a surjective solution s for ipi A • • • A i\) m - Let S be the 
subset of the variables V of assigned to by s, and consider the instances := $[S] 
and $ r := <&[V \ S]. Since the assignment is surjective, there is a variable x p G V that is 
mapped to 1 and a variable x q E V that is mapped to 0. Hence, for all i G {0, . . . , m — 1} 
the clauses (x p ,i,q — p)(x p ,i,q — p + l)\(x q ,i, 1) from the second group are neither in &i 
nor in <j? r , because they contain variables from both parts of the partition. Therefore, any 
clause from the first group in <&i will be disconnected in the incidence graph G(&i) from any 
other clause in the first group in <&i. Since each clause from the first group is satisfiable, it 
is easy to see that <3?z has a solution (7), a;). The same statements holds for <I> r ; let (T r , a r ) 
be a solution for $ r . Let T be the rooted tree obtained from 7] and T r by creating a new 
vertex t, linking the roots of 7) and T r below t, and making t the root of T. Let a be the 
common extension of ai and a r to all of U. Then (T, a) is clearly a solution to 3>. 

Both groups of clauses together consist of m + n(n — l)(m — 1) many clauses, and it is 
easy to see that the reduction can be implemented in polynomial time. □ 

We conclude this section with a combination of the results above. 



Proof of Theorem 5.1 As mentioned, the rooted phylogeny problem for C is clearly in NP. 



Let C be a class of triple clauses that is not a subset of T. We prove NP-hardness as follows. 



By Proposition 5.4 B(C) is neither Horn, dual Horn, affine, nor bijunctive. Theorem 5.3 
asserts that there exists a finite subset B of B(C) such that the split problem for B is NP- 
hard. This means that there is a subset C of C such that the split problem for B{C) is 



NP-hard. Proposition 5.5 shows that the rooted phylogeny problem for clauses from C (and 



hence also for clauses from C) is NP-hard as well. O 



6. Concluding Remarks 

We have shown that consistency of rooted phylogeny data can be decided in polynomial 
time when the data consists of tame disjunctions of rooted triples. Our algorithm extends 
previous algorithmic results about the rooted triple consistency problem, without sacrificing 
worst-case efficiency. The class T of tame triple clauses that can be handled efficiently is 
also motivated by another result of this paper, which states that any set of triple clauses 
that is not contained in T has an NP-complete rooted phylogeny problem. Here we use 
known results about the complexity of surjective Boolean constraint satisfaction problems. 

We also show that no Datalog program can solve the rooted triple consistency problem, 
using a pebble game that captures the expressive power of Datalog for constraint satis- 
faction problems with infinite w-categorical structures. In fact, our result follows from a 
more general result that also applies to many constraint satisfaction problems outside of 
phylogenetic reconstruction. We show that a constraint satisfaction problem for a structure 
with a large automorphism group cannot be solved by Datalog if, roughly, for all k there 
exists a unsatisfiable instance of girth at least k. 
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The class of phylogeny problems studied in this paper has a natural generalization to 
a larger class of computational problems, namely problems of the form CSP(T) where T 
has a first-order definition in A, the w-categorical relatively 3-transitive C-set introduced 
in Section [3} This class contains several additional problems that have been studied in 
phylogenetic reconstruction, for instance the quartet consistency problem [Ste92j. The 
larger class also contains new problems that can be solved in polynomial time, and where 
the split problem consists in finding surjective solutions to Boolean linear equation systems. 
A complexity classification for this larger class of computational problems remains open and 
is left for future research. 
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